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Abstract: This paper deals with some computational aspects in the Bayesian analysis 

of statistical models with intractable normalizing constants. In the presence of intractable 
fvj , normalizing constants in the likelihood function, traditional MCMC methods cannot be 

applied. We propose an approach to sample from such posterior distributions. The method 
^^ , can be thought as a Bayesian version of the MCMC-MLE approach of 8] . To the best of our 

knowledge, this is the first general and asymptotically consistent Monte Carlo method for 
j^ ■ such problems. We illustrate the method with examples from image segmentation and social 

C/3 . network modeling. We study as well the asymptotic behavior of the algorithm and obtain a 

strong law of large numbers for empirical averages. 
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OO !• Introduction 

o' 

. ^^ Statistical inference for models with intractable normalizing constants poses a major computa- 

X 

Lj ■ tional challenge. This problem occurs in the statistical modeling of many scientific problems. 

c^; n n 

Examples include the analysis of spatial point processes ([13]), image analysis ([10|]), protein de- 
sign ([Hi) and many others. The problem can be described as follows. Suppose we have a dataset 
xq E {^,B) generated from a statistical model e^^''\{dx)/Z{0) with parameter € (0,S), 
where the normalizing constant Z{9) = ^p^e^^' '\{dx) depends on 9 and is not available in 
closed form. Let jjl be the prior density of the parameter 6 G {&, H). The posterior distribution of 
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/Bayesian computation for intractable normalizing constants 2 

9 given xq is then given by 

AO) oc ^e^(-O'^)^(0). (1) 

When Z[9) cannot be easily evaluated, Monte Carlo simulation from this posterior distribution 
is problematic even using Markov Chain Monte Carlo (MCMC). [ij] uses the term doubly in- 
tractable distribution to refer to posterior distributions of the form ([T|). Current Monte Carlo 
sampling methods do not allow one to deal with such models in a Bayesian framework. For ex- 
ample, a Metropolis-Hastings algorithm with proposal kernel Q and target distribution vr, would 
have acceptance ratio min f 1 , ^ E(xQ,e) 'zW) 0(9 d') ] which cannot be computed as it involves the 
intractable normalizing constant Z evaluated at 6 and 9'. 

An early attempt to deal with this problem is the pseudo-likelihood approximation of Besag 



£ 



([J]) which approximates the model e '^'^ by a more tractable model. Pseudo- likelihood inference 
provides a first approximation but typically performs poorly (see e.g. jSl]). Maximum likelihood 
inference is possible. MCMC-MLE, a maximum likelihood approach using MCMC has been de- 
veloped in the 90's (71,^]). Another related approach to find MLE estimates is Younes' algorithm 



18l |) based on stochasti c ap proximation. An interesting simulation study comparing these three 



methods is presented in 



m- 



Comparatively little work has been done to develop asymptotically exact methods for the 
Bayesian approach to this problem. But various approximate algorithms exist in the literature, 
often based on path sampling ( a] ) . Recently, [I4] have shown that if exact sampling of X from 
e^^^'^> /Z{9) (as a density in {XJ3)) is possible then a valid MCMC algorithm to sample from 



([T|) can be developed. See also 



14( 1 for some improvements. Their approach uses a clever auxiliary 



variable algorithm. But intractable normalizing constants often occur in models for which exact 
sampling of X is not possible or is very expensive. Another recent development to the problem is 
the approximate Bayesian computation schemes of Plagnol-Tavare ([155) t)^^ which sample only 
approximately from the posterior distribution. 

In this paper, we propose an adaptive Monte Carlo approach to sample from ([T]). Our algorithm 
generates a stochastic process (not Markov in general) {(X„,0n), n > 0} in <Y x such that as 
n — ;• 00, the marginal distribution of 9^ converges to ([T]). It is clear that any method to sample 
from ([1]) will have to deal with the intractable normalizing constant Z{9). In the auxiliary variable 
method of [l^], computing Z{9) is replaced in a sense by perfect sampling from e^^^'^' /Z{9). This 
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/Bayesian computation for intractable normalizing constants 3 

strategy works well so long as perfect sampling is feasible and inexpensive. In the present work, 
we take another approach building on the idea of estimating the entire function Z from a single 
Monte Carlo sampler. The starting point of the method is importance sampling. Suppose that for 
some 61(0) g 6, we can sample (perhaps by MCMC) from the density e^(^''^*'"V^(6'^°^) in (X,B). 
Usin, this sample, we can certainly estimate Z(.)/Z(.(»)) fot any . 9. This is the same idea 
behind the MCMC-MLE algorithm of [8]. But it is well known that these estimates are typically 
very poor as 9 gets far from 0^^'. Now, suppose that instead of a single point 9^^\ we generate 
a population {9^^\ i = 1, . . . , d} in and that we can sample from h*{x,i) oc e^^^'^ ' ' /Z^O^"^') 
on Af X {1, . . . ,(i}. Then we show that in principle, efficient estimation for Z[0) is possible for 
any E B. Building on [3'] and the ideas sketched above, we propose an algorithm that generates 
a random process {(X„,0„), n > 0} such that the marginal distribution of X„ converges to A* 
and the marginal distribution of On converges to ([1]). This random process is not a Markov chain 
in general but we show (from first principle) that {On} has limiting distribution vr and satisfies a 
strong law of large numbers. 

The paper is organized as follows. A full description of the method including practical im- 
plementation details is given in Section [2j We illustrate the algorithm with three examples. The 
Ising model, a Bayesian image segmentation example and a Bayesian modeling of social networks. 
The examples are presented in Section [H Some theoretical aspects of the method are discussed 
in Section [3] with the proofs postponed to El 

2. Sampling from posterior distributions w^ith intractable normalizing constants 

Throughout, we fix the sample space {X,B^X) and the parameter space (6,H). The problem of 
interest is to sample from the posterior distribution ([1]) with 

Z{0) = I e^^''''^h{dx). (2) 



IX 

Let {0^^' , i = 1, . . . ,d} be a sequence of d points in Q. Let A* be the probability measure on 
<Y X {1, . . . , d} given by: 

A*(x,i) = ^^pyy, xex,ie{l,...,d}. (3) 
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/Bayesian computation for intractable normalizing constants 4 

Let ti{6, 6') be a similarity kernel on x such that X]f=i ^(^i ^'•*'') = 1 foi" ^-H ^ € 0- The starting 
point of the algorithm is the following decomposition of the partition function: 



Jx 



d 



i=l ■'^ 



^k(0,0«) / e^(-'^)-^(-'^'"')e^(^'^^'')A(dx 



J2 [ ^*ix,i)hgix,i)Xidx), (4) 



where 



he{x, i) = dKi9, 0«)^(0«)ei?(-,e)-i?(x,eW)_ ^5) 

The interest of the decomposition ^ is that {Z{6^^')} and A* do not depend on 9. Therefore, 
using samples from A*, this decomposition gives an approach to estimate Z(6) for all 6* € 0. 
This estimate should be reliable provided 9 is close to at least one particle 9^^'. The problem of 
sa.nplin, f.o. p.ob.bil.ty .eas™ such as A- has been .ecen.l, cons.de.ed by bl building ou 
the Wang-Landau algorithm of [17|]. We follow and improve that approach here. The resulting 
estimate of Z{9) can then continuously be fed to a second Monte Carlo sampler that carries the 
simulation with respect to tt. This suggests an adaptive Monte Carlo sampler to sample from ([1]) 
which we develop next. 

For any c = (c(l), . . . , c{d)) G M , we define the following density function on <Y x {1, . . . , d}: 

A,(x,i)oce^(^''^*'')-^«. (6) 

With c = log(Z), Ac = A*. The reader should think of c as an estimate of z, with z{i) := 
log Z{9^'^>). The algorithm will adaptively adjust c such that the marginal distribution on {1, . . . , d} 
is approximately uniform. In which case, we should have c{i) = \ogZ{i). Let {7n} be a sequence 
of (possibly random) positive numbers. We propose a non-Markovian adaptive sampler that lives 
in A" X {1, . . . , d} X M'^ X G. We start from an initial state {Xq, Iq, cq, ^o) G ^ x {1, . . . , d} x M^ x 6, 
where cq € M is the initial estimate of z. For example, cq = 0. At time n, given (X„, I„, c„, 9n) we 
first generate Xn+i from Pi^(Xn, ■), where Pi is a transition kernel on (X, B) with invariant distri- 
bution e^^^'^ ' ' /Z{9^^>). Next, we generate In+i from the distribution on {1, . . . ,d} proportional 
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/Bayesian computation for intractable normalizing constants 5 

to e '^ "+!' ' )~^"(*). Then we update the current estimate of log(Z) to c„+i given by: 

g£;(x„+i,eW)-c„(j) 



In view of dH), we can estimate ^(^) by: 



d 



Zn+i{e) = Y.K{eM^)e'^-*^ 



(i) 



i=l 






(8) 



with the convention that 0/0 = 0. Finally, for any positive function ^ : ^ (0, oo), let Q(^ be a 
transition kernel on (0,H) with invariant distribution 

vrc(^) oc ^e^(--^)/.(0). (9) 

Given (X„+i, /„+i, c„+i, Z„+i, 6'„), we generate 9n+i from (5z„+i (6'n, ■), where Z„+i is the function 
defined by ([8]). 

The algorithm can be summarized as follows. 

Algorithm 2.1. . Let {Xo,Io, co,9q) E X x {1, . . . , d} xR'^ x Q be the initial state of the algorithm. 
Let {7n} be (a possibly random) sequence of positive numbers. At time n, given {Xn,In,Cn,9n)- 

1. Generate Xn+i from, Pj^{Xn,-) where Pi is any ergodic kernel on {X,B) with invariant dis- 

tribution e^^^'^^'^^ /Z{i). 

2. Generate /n+i by sampling from the distribution on{l, . . . ,d} proportional to e^(^n+i.^ ' )-c„(?) _ 

3. Gompute Cn+i, the new estimate of g using ^. 

4. Using the function Zn+i defined by (E^, generate 6n+i from Qz„+i{Gn, •)• 

Remark 2.1. 1. The algorithm can be seen as an MCMC-MCMC analog to the MCMC-MLE 
of [3]. Indeed, with d = 1, the decomposition ([4|) becomes 



z{e)/z{e^^^) =E 



E{e,x)-E{x,eW) 



where the expectation is taken with respect to the density e^^' ' /Z{6^^'). But as discussed 
in the introduction, when E{6,X) — E{X,9^'^') has a large variance, the resulting estimate 
is terribly poor. 
2. We introduce k, to serve as a smoothing factor so that the particles 9^"^' 's close to 6 contribute 
more to the estimation of Z{9). We expect this smoothing step to reduce the variance of 
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the overall estimate of Z{6). In the simulations we choose 



<0, 



i(j)> 



- 1 |[6)-e(')| 
e 2h2 II I 



1 



\0-0O) 



|2 



The value of the smoothing parameter h is set by trials and errors for each example. 

3. The implementation of the algorithm requires keeping track of all the samples Xk that are 
generated (Equation ([8])). .Y can be a very high-dimensional space and we are aware of the 
fact that in practice, this bookkeeping can significantly slow down the algorithm. But in 
many cases, the function E takes the form E{x^ 6) = X]z=i Si{x)9i for some real- valued func- 
tions Si. In these cases, we only need to keep track of the statistics {(5i(X„), . . . , SxiXn)) , n > 
0}. All the examples discussed in the paper fall in this latter category. 

4. As mentioned earlier, the update of {Xn, In, Cn) is essentially the Wang-Landau algorithm 
of [Sj] with the following important difference. [3|] propose to update c„ one component per 
iteration: 

Cn+l{i) = Cn{i) -h7nl{i}(^+l)- 

We improve on this scheme in ^ by Rao-Blackwellization where we integrate out In+i- 

5. As mentioned above, and we stress this again, this algorithm is not Markovian in any way. 
The process {(X„,/„,c„)} is not a Markov chain but a nonhomogeneous Markov chain if 
we let {jn} be a deterministic sequence. {9n}, the main random process of interest is not a 
Markov chain either. Nevertheless, the marginal distribution of 0„ will typically converge to 
vr. This is because, Qz„, the conditional distribution of 9n+i given jr„ converges to Qz as 
n ^ 00 and Qz is a kernel with invariant distribution tt. We make this precise by showing 
that a strong law of large numbers holds for additive functionals of {On}- 

We now discuss the choice of the parameters of the algorithm. 

2.1. Choosing d and the particles {6^'^'} 

We do not have any general approach in choosing d and j^^*'} but we give some guidelines. The 
general idea is that the particles j^*-*)} need to cover reasonably well the range of the density vr 
and be such that for any S 0, the density e^^' ' /Z{6) in X can be well approximated by at 
least one of the densities e^*-^'^ ' ' /Z{i). One possibility is to sample 9^"^' independently from the 
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/Bayesian computation for intractable normalizing constants 7 

prior distribution //, some tempered version of it or some other similar distribution. We follow 
this approach in the examples below. Another possibility is to use a grid of points in Q. The 
value of d, the number of particles, should depend on the size of Q. We need to choose d such 
that the distributions e^^^'^ ' ' /Z{i) (in X) overlap. Otherwise, estimating the constants {Z{i)} 
can be difficult. This implies that d should not be too small. In the simulation examples be! low 
we choose d between 100 and 500. 

2.2. Choosing the step-size {'Jn} 



It is shown in [3] that the recursion ([7]) can also be written as a stochastic approximation algorithm 
with step-size {"fn}, so that in theory, any positive sequence {7^} such that X^Tn = 00 and 
X]7n < 00 can be used. But the convergence of Cn to log Z is very sensitive to the choice {7^}- 
If the 7„'s are overly small, the recursive equation in ([7]) will make very small steps. But if these 
numbers are overly large, the algorithm will have a large variance. In both cases, the convergence 
to the solution will be slow. Overall, choosing the right step-size for a stochastic approximation 
algorithm is a difficult problem. Here we follow 3] which has elaborated on a heuristic approach 
to this problem originally proposed by [iTJ. 

The main idea of the method is that typically, the larger 7„, the easier it is for the algorithm 
to move around the state space. Therefore, at the beginning 70 is set at a relatively large value. 
This value is kept until {/„} has visited equally well all the points of {!,... ,d}. Let ti be the 
first time where the occupation measure of {1, . . . , d} by {In} is approximately uniform. Then we 
set 7ri+i to some smaller value (for example 7ti+i = 7ti/2) and the process is iterated until 7^ 
become sufficiently small. As which point we can choose to switch to a deterministic sequence of 
the form 7„ = n~^'^^^. Combining this idea with Algorithm 12.11 we get the following. 

Algorithm 2.2. . Let 7 > ei > 0, £2 > 6e constants and let {Xq,Iq,co,9q) be some arbitrary 
initial state of the algorithm. Set v = £ M. and n = 0. While 7 > ei and given Tn = 
(y{{Xk,Ik,Ck,9k), k < n}, 



1. Generate [Xn+i, In+i, Cn+i, On+i) as in Alaorithm \2.1[ 

2. For i = 1, . . . ,d: set v{i) = v{i) + li{In+i). 

3. // maxj v{i) — ^ ^ if ; then set 7 = 7/2 and set u = G 
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/Bayesian computation for intractable normalizing constants 8 

Remark 2.2. We use this algorithm in the examples below with the following specifications. We 
set the initial 7 to 1, £2 = 0.2. We run {(X„,/„,c„)} until 7 < ei = 0.001 before starting {0„} 
and switching to a deterministic sequence 7„ = ei/inP''^ . 

3. Convergence analysis 

In this section, we derive a law of large numbers under some verifiable conditions. The process 
of interest here is {On}- Let (il, J^, Pr) be the reference probability triplet. We equip (fi,^, Pr) 
with the filtration {.Fn}, where J^n = o"{(-^fe+i,-^A;+ii Cjt+i,6'fc), k < n}. Note that ^„ includes 
(X„+i, /„+!, c„+i) since these random variables are used in generating 6n+i- From the definition 
of the algorithm, we have: 

Pl{en+ieA\J^n) = Qz„+r{On,A), Pi -U.S.. (10) 

We see from (jlOp that {9n,^n} is an adaptive Monte Carlo algorithm with varying target distri- 
bution. In analyzing {On,J^n} here, we do not strive for the most general result but restrict ourself 
to conditions that can be easily checked in the examples considered in the paper. We assume that 
© is a compact subset of M"^, the q-dimensional Euclidean space equipped with its Borel cr-algebra 
and the Lebesgue measure. Firstly, we assume that the function E is bounded: 
(Al): There exist m,M €£M such that: 

m < E{x, e) <M, x£X,e gQ. (11) 

In many applications, and this is the case for the examples discussed below, A" is a finite set 
(typically very large) and is a compact set. In these cases, (Al) is easily checked. In order to 
proceed any further, we need some notations. A transition kernel on (X, B) operates on measurable 
real- valued functions / as Pf{x) = J P{x,dy)f{y), and the product of two transition kernels Pi 
and P2 is the transition kernel defined as PiP2{x,A) = J Pi{x,dy)P2{y,A). We can then define 
recursively P" = PP"^^, n > 1, P^(x,A) = lyi(x). For two probability measures /U,i/, the total 
variation distance between /i and u is defined as ||/i — i^Wrpy := sup^ l/^(^) ~ ^(^)l- We say that 
a transition kernel P is geometrically ergodic if P is (/>-irreducible, aperiodic and has an invariant 
distribution vr such that: 

||P"(x,-)-vr||^y <M(a;)p", n>0 
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/Bayesian computation for intractable normalizing constants 9 

for some p € (0, 1) and some function M : Af — > (0, oo]. 

Our next assumption involves the transition kernel Q(^. 
(A2): For (^ : — > (0, oo), Q^ is a Metropolis kernel with invariant distribution vr^ and proposal 
kernel density p. There exist e > and an integer uq > 1 such that for all 9, 9' € ©.' 

p"«(e,0')>e- (12) 



Remark 3.1. 1. The condition (J12p clearly holds for most symmetric proposal kernels p{9, 9'), 
provided that p{9, 9') remains bounded away from on some ball centered at 9. 
2. (|12p often implies that Q(^ is uniformly ergodic: 

QdO^A) > y'min('l,e^(^°'^')-^(-°'^)^)p(0,0')'i^' 
> e— *^ inf (^) f p{9,9')d9'. 
Therefore, provided miefi'eB (zm) > 0> if dUD hold then Q""((9,^) > e' iiLeb{A) for some 



SW)) ' ^^^ ^C 

e' >0. 

(A3): {7n} is a random sequence, adapted to {Tn} such that 7n > 0, X^Tn = oo and X^Tn < °° 
Pr-a.s. 

Theorem 3.1. Assume (Al)-(A3). Assume also that each kernel Pi on {X,B) is geometrically 

ergodic. Let h : (B, S) ^ M such that \h\ < 1. Then: 

1 " 

-Y.h{Ok)^Ah),'P^-a.s. (13) 

Proof. See Section [6l D 

4. Examples 

4.-/. Ising model 

We test the algorithm with the Ising model on a rectangular lattice. This is a simulated example. 
The model is given by e ^^' /Z{9) where 

m n—1 m—1 n 

^(^) =^Y1 ^ij^i,j+'^ + XI X! XijXi+ij, (14) 

i=l j=l i=l j=l 
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/Bayesian computation for intractable normalizing constants 10 

and Xi € {1, —1}- We use m = n = 64. We generate the data xq from e^^^^' /Z{9) with 6 = 0.40 
by perfect samphng using the Propp- Wilson algorithm. Using xq and postulating the model 
e^^^'-^yz{9), we would like to infer on 9. We use the prior fi{9) = l(o,i)(6'). We set d = 100 
and generate the points {9^'^'} independently and uniformly in (0,1). As described in Section 
12.21 we use the flat histogram approach in selecting {jn} with an initial value 70 = 1, until 7„ 
becomes smaller than 0.001. Then we start feeding the adaptive chain {9n} which is run for 10, 000 
iterations. In updating 9n, we use a Random Walk Metropolis sampler with proposal distribution 
U{9n — b,9n + b) (with reflexion at the boundaries) for some 6 > 0. We adaptively update b so as 
to reach an acceptance rate of 30% (see e.g. [2])- We d! iscard th! e first 1, 999 points as a burn-in 
period. The results are plotted on Figure 1. As we can see from these plots, the sampler appears 
to have converged to the posterior distribution vr. The mixing rate of the algorithm as inferred 
from the autocorrelation function seems fairly good. In addition, the algorithm yields an estimate 
of the partition function logZ(9) which can be re-used in other sampling problems. 



(a) 



(b) 





2000 4000 6000 8000 



(c) 



(d) 




0.38 0.39 0.40 0.41 




Figure 1: Output for the Ising model 9 = 0.40, m 



n 



64. (a): estimation oi\ogZ{9) up to an 



additive constant; (b)-(d): Trace plot, histogram and autocorrelation function of the adaptive 

sampler {9n\- 
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^.■2. An application to image segmentation 

We use the Ising model above to illustrate an application of the methodology to image segmen- 
tation. In image segmentation, the goal is the reconstruction of images from noisy observations 



(see e.g. [9|, llO|). We represent the image by a vector x = {xj, i G 5} where 5 is a m, x n lattice 
and Xi € {1, . . . ,K}. Each i ^ S represents a pixel, and Xi is the color of the pixel i. K is the 
number of colors. Here we assume that K = 2 and Xi € {—1, 1} is either black or white. In the 
image segmentation problem, we do not observe x but a noisy approximation y. We assume that: 

yi\x,o ~ A/(xi,(T ), (15) 

for some unknown parameter o"^. Even though (J15p is a continuous model, it has been shown 
to provide a relatively good framework for image segmentation problems with multiple additive 
sources of noise ( |10] ) . 

We assume that the true image x is generated from an Ising model (see Section 14. Ih with 
interaction parameter d. We assume that Q follows a uniform prior distribution on (0,1) and 
that o"^ has a prior distribution that is proportional to l/o"^l(o,oo)('^^)- The posterior distribution 
{0,a'^,x) is then given by: 



v^(^,a^x|y) oc (i^) ' ^' ^e-i^S^.^(^(^)-^(^»'l(o,i)(e)l(o,oo)(a^), (16) 

where E is as in (|14p . 

We sample from this posterior distribution using the adaptive chain {(yn,in,Cn,On,o'n,Xn)}- 
The chain {{yn,'i'n,Cn)} is updated following Steps (l)-(3) of Algorithm 12.11 It is used to form 
the adaptive estimate of Z{6) as given by (^ (with {ymin} replacing {X„,I„}). These estimates 
are used to update {9n,o''^,Xn) using a Metropolis-within-Gibbs scheme. More specifically, given 
(7^,x„, we sample 6n+i with a Random Walk Metropolis with proposal U{9n — b,6n + b) (with 
reflexion at the boundaries) and target proportional to ^^ ^Z . Given 0„+i,x„, we generate cr^+i 
by sampling from the Inverse-Gamma distribution with parameters ( 2 ' 2 ^s&s ivi^) ~ ^i^)) )• 
And given {9n+i, Cn+i), we sample each x„+i(s) from its conditional distribution given {x(n), u 7^ 
s}. This conditional distribution is given by 






p{x (s) = a\x{u),u 7^ s) oc exp 9a 2J x{u) — --^(y(s) —a) , a € {—1, 1}. 

\ u~s / 
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/Bayesian computation for intractable normalizing constants 12 

Here u ^ v means that pixels u and v are neighbors. 

To test this algorithm, we generate a simulated dataset y according to model (jlSp with x 
generated from e^^^^' /Z{9) by perfect sampling. We use m = n = 64, 6 = 0.40 and a = 0.5. For 
the implementation details of the algorithm, we make exactly the same choices as in Example 
14.11 above. In particular we choose d = 100 and generate {0'*^} uniformly in (0,1). The results 
are given in Figure 2. Once again, the sample path obtained from {9n} clearly suggests that the 
distribution of 9n has converged to tt with a good mixing rate, as inferred from the autocorrelation 
plots. 



(a) 




(b) 



iiillL 



2000 4000 6000 8000 



0.38 0.39 0.40 0.41 



(<:) 




100 200 300 400 



«l) 




2000 4000 6000 8000 




00 200 300 400 



Figure 2: Output for the image segmentation model, (a)-(c): plots of {On}', (d)-(f): plots of {(T„}. 

4-3. Social network modeling 

We now give an application of the method to a Bayesian analysis of social networks. Statistical 
modeling of social network is a growing subject in social science (See e.g. [l6| and the references 
therein for more details). The set up is the following. We have n actors I = {1, . . . ,n}. For each 
pair {i,j) £ I x I, define yij = 1 if actor i has ties with actor j and yij = otherwise. In the 
example below, we only consider the case of a symmetric relationship where yij = yji for all 
i,j. The ties referred to here can be of various natures. For example, we might be interested in 
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modeling how friendships build up between co-workers or how research collaboration takes place 
between colleagues. Another interesting example from political science is modeling co-sponsorship 
ties (for a given piece of legislation) between members of a house of representatives or parliment. 
In this example we study the Medici business network dataset taken from [la] which describes 
the business ties between 16 Florentine families. Numbering arbitrarily the family from 1 to 16, 
we plot the observed social network in Figure 3. The dataset contains relatively few ties between 
families and even fewer transitive ties. 




Figure 3: Business Relationships between 16 Florentine families. 



One of the most popular models for social networks is the class of exponential random graph 
models. In these models, we assume that {yij} is a sample generated from the distribution 



P {y\Oi, ...,OK)(xexpiY^ OiSiiy) j 



(17) 



for some parameters 9i, ... ,9k'-, where SAjj) is a statistic used to capture some aspect of the 
network. For this example, and following [la|, we consider a 4-dimensional model with statistics 



Si{y) = ^ yij, the total number of ties, 

i<j 

S2{y) = 2_/ VikVjk^ the number of two-stars, 

i<j<k 

Ssiy) = X! ynyjiVki, the number of three-stars, 

i<j<k<l 
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Si{y) = ^ yikUjkyij, the number of transitive ties. 

i<j<k 

We assume a uniform prior distribution on D = (—50, 50)^ for 9 = {9i, ^2i ^3) ^4) and the posterior 



distribution writes: 



^ (^|y) « ^ ^^p (e ^fc'5fc(y)) ii5(^)- 



(18) 



We use Algorithm 12 . II to sample from (jlSp . For this example, we use 400 particles {0*-''} generated 
from a A^(0, 5) the normal distribution with mean and variance 5. We use the same parametriza- 
tion as in the previous examples to update {Xn,In,Cn)- For the adaptive chain {0n} we use a 
slightly different strategy. It turns out that some of the components of the target distribution vr 
are strongly related. Therefore we sample from vr in one block, using a Random Walk Metropolis 
with a Gaussian kernel A^(0, cr^S) (restricted to D) for some o" > and a positive definite matrix 
S. We adaptively set a so as to reach an acceptance rate of 30%. Ideally, we would like to choose 
E = Sjr the variance-covariance of tt which of course, is not available. We adaptively estimate S^r 
during the simulation as in [2,]. As before, we run (X„,/„,c„) until 7„ < 0.1001. Then we start 
{9n} and run the full chain (X^, In,Cn,9n) for a total of 25,000 iterations. The posterior distri- 
butions of the parameters are given in Figures 4a-4d. In Table 1, we give the sample posterior 
mean together with the 2.5% and 97.5% quantiles of the posterior distribution. Overall, these re- 
sults are consistent with the maximum likelihood estimates obtained by [l6|] using MCMC-MLE. 
The main difference appears in ^4 which we find here not to be significant. As a by-product, the 
sampler gives an estimate of the variance-covariance matrix of the posterior distribution vr: 



1.67 -0.41 0.27 -0.07 

-0.41 1.83 -0.47 -0.02 

0.27 -0.47 1.78 -0.03 

-0.07 -0.02 -0.03 1.65 



(19) 
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Parameters Post, mean Post, quantiles 

"el ^2l4 (-3.32,-0.81) 

6*2 0.94 (-0.43,2.49) 

6»3 -1.06 (-2.72,0.04) 

6»4 0.09 (-1.39,1.07) 



Table 1 
Summary of the posterior distribution of the parameters. Posterior means, 2.5% and 97.5% quantiles 



(a) 



(b) 



(c) 




~i 1 1 1 r 

5000 15000 25000 



I 1 1 1 1 

-4 -3 -2 -1 




Figure 4a: The adaptive MCMC output from ([T8|). (a)-(c): Plots for {Oi}. Based on 25,000 



iterations. 



(a) 



(b) 



(c) 




T 1 1 1 1 r 

5000 15000 25000 



I 1 1 1 1 

-10 12 3 




Figure 4b: The adaptive MCMC output from ([HD. (a)-(c): Plots for {6*2}. Based on 25,000 



iterations. 



(a) 



(b) 



(c) 




T 1 1 1 1 r 

5000 15000 25000 



A 

I 1 1 1 1 

-4 -3 -2 -1 




Figure 4c : The adaptive MCMC output from ([18]). (a)-(c): Plots for {6*3}. Based on 25,000 

iterations. 
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(a) 



(b) 



(c) 




~i 1 1 r 

5000 15000 25000 



iiiiiili^^BL 

I — I — I — I — I 

-3 -2-10 1 




Figure 4d : The adaptive MCMC output from ^. (a)-(c): Plots for {0^}. Based on 25,000 



iterations. 



5. Conclusion 

Sampling from posterior distributions with intractable normalizing constants is a difficult compu- 
tational problem. Thus far, all methods proposed in the literature but one entail approximations 
that do not vanish asymptotically. And the only exception ([I4]) requires exact sampling in the 
data space, which is only possible for very specific cases. In this work, we propose an approach 
that both is more general than [12] and satisfies a strong law of large numbers with limiting 
distribution equal to the target distribution. The few applications we have presented here sug- 
gest that the method is promising. It remains to be determined how the method will scale with 
the dimensionality and with the size of the problems, although in this respect, adaptations of 
the method involving annealing/tempering schemes are easy to imagine, which would allow large 
problems to be analysed properly. 
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6. Proof of Theorem [37T] 

Proof. Throughtout the proof, C will denote a finite constant but whose actual value can change 
rom one equation to the next. The convergence of the Wang-Landau algorithm has been studied in 
3]. It is shown in this work that under the condition of Theorem 13. 11 miuj J2'k^=i l|i|(^fc) = °o ^^^d 
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more importantly, e'^"-^'^' /Yl,j=i^ converges almost surely to a Z{9^^') (up to a multiplicative 
constant). 
Define 



w„(i) - ^^ 



oCri(i) 



Vn,i{^) 






Ek=lUlk) 

and 

^n(^) := ^#^^ = E ^n(^)^n,^(^)- (20) 

Instead of Z„, we work with Z„. This is equivalent because J2j=i e'^'"'^^' does not depend on Q and 
Z„ always appears in Qz^ ^^ ^ ratio. We have: 

infZ„(^)>e— ^^ (21) 

inf (i^] >e2(™-^^). (22) 

Combining (|22|) and (|12|) and part 2 of Remark 13. H we deduce that there exists eo > such that 
for all n,j > 



sup 

\h\<i 



Qihie) - TTz^h) <2il-eoy, Pr-a.s. (23) 



We introduce the notation Qn = Q^ ~ "^z ■ ^^ follows from (I23p that for any n > 1 the following 
function gn is well defined: 

oo 

9ni9) = T.Qihi0)- 
i=i 

Moreover \gn{0)\ < 2/eo for all 9 G Q. gn satisfies Poisson's equation for Q„ and h — tt^ {h): 

gn{0)-Qn9n{0) = h-7rz^{h). (24) 

Using this we can rewrite J2^=i ^{^k) — T^z^-ih) as: 

-| 71 1 " 1 " 

- E (^(^fc) - T^Zki^)) = - E (5'fc(6'fc) - Qkgk{Ok-i)) + - E {Qk9k{0k-i) - Qk-i9k-iiOk~i)) 

^ k=l ^ k=l ^ fc=l 

+ ^ (Q050(^0) - Qn<?n(0n)) • (25) 
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Since supgg0 \gn{0)\ < 2/eo, a similar bound hold for QnQn and we conclude that y^ (Qo5o(^o) — Qngn{On)) 
actually converges almost surely to as n ^ oo. Writing Dk = gk{Gk) — QkOki^k^i), it is eas- 
ily seen that {Di.,J^k} is a martingale difference with bounded increment and we deduce from 



martingales theory that - I]fc=i {gk{Qk) — Qkgk{&k-i)) converges almost surely to as 



n — > oo. 



Since Qz is a Metropolis kernel and using the fact that |min(l, ax) — min(l, ay)\ < a\x — y\ 
for all a, X, y > we deduce that for any function h : ^ M such that \h\ < 1, 

Znie) Zn-i{e) 



{Qz^-Qz^JKe)\ < J 



z„ 



^E{xo,e')-E{xo 



< 2e^'-'^ sup 
e,6i'ee 



Zn{(^) Zn-l{9) 



9')\h{e')-h{9)\de' 



Zn{0') 



z. 



n-l\ 



9') 



< C 



Zn{e) - Zn-ii9) , using m-^, 



(26) 



for some finite constant. Combining (|23l and [26l) we have the following well-known consequence: 
there exists C < oo such that for all n > 1: 



sup vr^ (h) - vr^ (h) < Csup Zn{9) - Zn-i{6) . 

\h\<i " " eee 



(27) 



The stability of Poisson's equation for geometrically ergodic transition kernels is well known (see 
e.g. [l|, l3|]). Combining (j23p . (|26p and ()27p . we can find a finite constant C such that for all k >\: 



\{Qkgk{Ok-i) - Qk-igk-i{Ok-i))\ < C sup Zk{0) - Zk-i{0) 



(28) 



Given the expression of Zn{0) in (|2Up it is not very hard to show there exists C < oo such that: 

1 ^ 



sup 



Zk{9)-ZkM9) <C[djk + 



0, 



(29) 



as /c — > oo as discussed above. It follows indeed that ^ J2k=i (^(^fc) ~ '^z^i^)) converges a.s. to 0. 
Given that Zn{0) — > CZ{9) almost surely for some finite constant C, 



^z„(/i) 



J^-S^h{e)de r-^^h{e)de 



Zn(e) 






de 



7r{h), 



as n ^ oo by Lebesgue's dominated convergence. 



n 
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